Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion): Add -e flag to uv command in ingestion Dockerfiles #10114

Merged
merged 9 commits into from
Apr 10, 2024

Conversation

skrydal
Copy link
Collaborator

@skrydal skrydal commented Mar 22, 2024

Introducing flag -e to uv pip install command in ingestion Dockerfiles to avoid having to specify package name in the Dockerfiles.

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata devops PR or Issue related to DataHub backend & deployment community-contribution PR or Issue raised by member(s) of DataHub Community datahub-community-champion PRs authored by DataHub Community Champions labels Mar 22, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented Apr 1, 2024

@skrydal I'm not really a huge fan of this. The way we're setting RELEASE_VERSION already felt like a hack, and this is going to make that even messier.

I would like to understand what changes you're making in your fork of the acryl-datahub and airflow plugin packages. My goal has been to add extension hooks in the main codebase that allow using the main acryl-datahub release, even with forks. For example, allowing you to use custom models https://datahubproject.io/docs/next/metadata-modeling/extending-the-metadata-model#optional-step-7-use-custom-models-with-the-python-sdk

@skrydal
Copy link
Collaborator Author

skrydal commented Apr 3, 2024

Hello @hsheth2 , thank you for taking time to review my proposal. As I mentioned in the PR description - I am not a great fan of seding through the source files myself - although any change which would allow us to achieve what we need which wouldn't involve using sed would require quite a big refactor of the setup.py file - I am happy to discuss it but would prefer to first agree whether the functionality itself would be accepted.
The problem which we are addressing here is not the one you described. We maintain a fork of entire repository and keep it synced to the (usually) latest release of this repository. We build all the binaries, packages and Docker images - just as you do in your GHA workflows. What do we do with these artifacts further down the road? In case of datahub-ingestion image we have a Dockerfile based on it, containing a fine-grained changes and fixes as well as including code of some custom source connectors (use-case specific). Now these custom source connectors are dependent on pip package acryl-datahub which is built by us in our internal process from our fork.
Up to this point there is no need to change the name of pip package we are building (we could even refrain from building it). But we discovered that it is crucial for us to build our own pip package of datahub - imagine we recognize a problem in, let's say, redshift connector - we might be able to develop a fix to that within couple of hours and keep it in our fork - of course we are going to contribute it back to this project but it will take days before it is merged, weeks before it is released and potentially months before we upgrade to that release (had cases for that in the past). So in the meantime we simply branch out of master in our fork, commit the fix, tag it with some new tag, i.e. 0.13.0+1 and build new pip package. If we named it acryl-datahub then there would be 2 different versions trees of acryl-datahub available to our ingestion Docker image - one coming from our internal artifactory and one coming from public pip repo to which you commit the artifacts - this wouldn't be good situation in terms of operations and maintenance. So we use different name for the pip package to be committed into our artifactory - let's say skrydal-datahub. The package on which our custom sources connectors is not acryl-datahub but instead skrydal-datahub to have all the fixes we deemed necessary before the next upgrade.
Did I shine some more light on our problem and reason for a need to rename the package? Would you deem such feature desirable? We could discuss whether there are better ways of doing this than sed.

@hsheth2
Copy link
Collaborator

hsheth2 commented Apr 4, 2024

Thanks for the context. A couple thoughts here

  • The uv issue of hardcoding the package name in the pip install command (Support URL and path dependencies without a package name astral-sh/uv#313) has been fixed in recent versions, so we should be able to go back to not using package names in our install commands. Happy to accept a PR related to that :)
  • For fast iteration on sources - every PR against the datahub repo gets a "vercel" deployment. You can actually put the deployment url directly in as the CLI version for testing (see screenshot)
  • Given that we have no way of testing these sed-related changes since we don't use them ourselves, we will definitely end up accidentally breaking your workflow as we make tweaks to our deployment process and dockerfiles. We'd be happy to work with you to structure our code for make forks easier, but it would be a best effort thing. I don't think we'd accept this PR as is.
Untitled

@skrydal skrydal changed the title feat(ingestion): Custom names for datahub and datahub-airflow-plugin pip packages feat(ingestion): Add -e flag to ingestion Dockerfiles Apr 8, 2024
@skrydal skrydal changed the title feat(ingestion): Add -e flag to ingestion Dockerfiles feat(ingestion): Add -e flag to uv command in ingestion Dockerfiles Apr 8, 2024
@skrydal
Copy link
Collaborator Author

skrydal commented Apr 8, 2024

@hsheth2 thank you for the explanation. I changed my proposal to simply remove package name from Dockerfiles (it was working like this before and we could fit it into our workflow). Let me know if this PR can be merged :)

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Will need to understand why the tests failed though

@skrydal
Copy link
Collaborator Author

skrydal commented Apr 9, 2024

LGTM

Will need to understand why the tests failed though

The error message indicates docker login problems - https://github.com/datahub-project/datahub/actions/runs/8600363235/job/23565456986?pr=10114#step:9:27
This is typical problems of PRs from forked repos. But it doesn't seem to be related to the changes proposed in the PR.

@hsheth2
Copy link
Collaborator

hsheth2 commented Apr 9, 2024

CI flake due to #10235

@hsheth2 hsheth2 merged commit 67b67f7 into datahub-project:master Apr 10, 2024
35 of 36 checks passed
sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community datahub-community-champion PRs authored by DataHub Community Champions devops PR or Issue related to DataHub backend & deployment ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants